####

Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.

Motivation


According to a recent report, overall tobacco use increased in youths (middle schooland high school students) in the United States in 2017 and 2018, despite previous years of declining use.

This major increase is attributed to an increase in the use of electronic cigaarette (e-cigarette) products.

E-cigarettes are referred to by many different names, including but not limited to:

  1. Electronic nicotine delivery systems (ENDS)
  2. Vapes
  3. e-hookahs
  4. vape pens
  5. tanks
  6. mods

The devices vary greatly:

[source]

See this CDC guide and the American Lung Association website for more information.

The report found that:

During 2017–2018, current use of any tobacco product increased 38.3% (from 19.6% to 27.1%) among high school students and 28.6% (from 5.6% to 7.2%) among middle school students; e-cigarette use increased 77.8% (from 11.7% to 20.8%) among high school students and 48.5% (from 3.3% to 4.9%) among middle school students.

In 2018, the Federal Drug Administration (FDA) in the United States stated that e-cigarette usage use among youth reached:

“nothing short of an epidemic proportion of growth

In this case study, we will be invistigating the same data used in the report that generated the above findings. This data comes from the The National Youth Tobacco Survey (NYTS).

Gentzke, Andrea S., Melisa Creamer, Karen A. Cullen, Bridget K. Ambrose, Gordon Willis, Ahmed Jamal, and Brian A. King. “Vital Signs: Tobacco Product Use Among Middle and High School Students - United States, 2011-2018.” MMWR. Morbidity and Mortality Weekly Report 68 (6): 157–64 (2019).

Main Questions


Our main question:

  1. How has tobacco/nicotine use by American youths changed since 2015?
  2. How do vaping rates compare between males and females?
  3. What vaping brands and flavors appear to be used the most frequently?
    We will base this on the following survey questions:
    > “During the past 30 days, what brand of e-cigarettes did you usually use?”
    >“What flavors of tobacco products have you used in the past 30 days?”

  4. Have vaping rates possibly influenced tobacco/nicotine use?

Learning Objectives


In this case study, we will cover how to import data from multiple files efficiently, how to import data from excel files, and how to make a variety of visualizations to compare multiple groups across time. We will also demonstrate how to work with codebooks. We will cover the concept of survey weighting and introduce the srvyr package. We will discuss the difference between pooled cross-sectional data and panel data. We will especially focus on using packages and functions from the Tidyverse for wrangling data, such as tidyr and dpyr and for visualization, such as as ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.


We will begin by loading the packages that we will need:

Package Use
here to easily load and save data
readxl to import the data in the excel files
purrr to import the data in all the different excel and csv files efficiently
readr to import the CSV file data
dplyr to arrange/filter/select/compare specific subsets of the data
summarytools to get an overview of data in a different style
tidyr to rearrange data in wide and long formats

stringr | to manipulate the character strings within the data
ggplot2 | to make visualizations with multiple layers
ggpubr | to easily add regression line equations to plots
forcats | to change details about factors (categorical variables)
lmerTest| to perform linear mixed model testing
car| to perform Levene’s Test of Homogeneity of Variances
ggiraph| to make plots interactive
ggforce| to modify facets in plots
viridis| to plot in color palette
cowplot | to allow plots to be combined skimr | to get an overview of data

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context


According to the cited Morbidity and Mortality Weekly Report this was what was already known about this topic and the implications of this study:

Importantly, the vapors used in e-cigarettes contain harmful chemicals:

E-cigarette usage has also been associated with lung injury

See here for additonal information about the potential health effects of e-cigarettes in teens and young adults.

Limitations


There are some important considerations regarding this data analysis to keep in mind:

AVOCADO..Need Michael to look at these…

  1. The data included in the National Youth Tobacco Survey (NYTS) does not follow the same individual students over time. A longitudinal study that does follow the same individuals over time collects data called panel data. The data in this study is called pooled [cross-sectional data]https://en.wikipedia.org/wiki/Cross-sectional_data, this data is obtained from random collection of obervations across time.

According to wikipedia: >Panel data differs from pooled cross-sectional data across time, because it deals with the observations on the same subjects in different times whereas the latter observes different subjects in different time periods

AVOCADO revisit this… 2) The data also includes percentages of students that reported use of particular tobacco product, but the survey questions did not ask how much uses of one product compared to another - for example menthol flavored products where used at the near exclusion of other flavors.

While gender and sex are not actually binary, the data used in this analysis only contains information for groups of individuals who answered the survey questions as male or female.

AVOCADO.. should we drop the race data? we didnt use it… Furthermore, while the race of individual students was surveyed and asked students to self report based on what race or races they considered themselves to be, this list only included the following options:
1) Hispanic, Latino, Latina, or Spanish origin as one single race with the following further options:
1) Mexican, Mexican American, Chicano, or Chicana
2) Puerto Rican
3) Cuban
4) Yes, Another Hispanic, Latino, Latina, or Spanish origin
2)

What are the data?


The data in this case study comes from the National Youth Tobacco Survey (NYTS) which is an annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America.

The data for this survey is freely available online at this website with data from 1999, 2000, 2002, 2004, 2006, 2009, and 2011-2019. We will be using data from 2015-2019 due to the fact that these years are the most recent that asked questions recarding e-cigarette usage.

Each year includes documentation, such as a codebook and an excel file containing the data:

Therefore, since we are using data from 2015-2019, the data we are interested in is located in 5 different excel files, each with their own codebook.

The codebook contains information describing the data within the excel file.

As you can see the excel file contains very short variable names and values, and it is not clear what they mean without the codebook:

The codebook explains what the variables (the columns) are:

And the codebook explains what the values for each variable are:

We will explain more later about what the values on the right indicate.

The reason that there are codebooks for each year is because the questions asked year varied slightly.

The data in this survey is what is called pooled cross-sectional data. In otherwords, different subsets of students are surveyed each year and it is not clear which, if any , individuals particpate from one year to the next.

Data Import


Since these excel files are so large, it takes a bit of time for the data to load. To make the process faster, we previously imported these files, selected only our questions of interest, and saved this data as csv files.

AVOCADO should we drop the race data? we didn’t end up doing anything with this….

Click here for details on how the data was originally imported

First we created a list of filenames of all the different excel files. Using the here() function of the here package, we looked in all the directories of the project. The list.files() function looked for all files with .xlsx within these subdirectories.

[1] "docs/2015-nyts-dataset-and-codebook-microsoft-excel/~$nyts2015.xlsx"
[2] "docs/2015-nyts-dataset-and-codebook-microsoft-excel/nyts2015.xlsx"  
[3] "docs/2016-nyts-dataset-and-codebook-microsoft-excel/nyts2016.xlsx"  
[4] "docs/2017-nyts-dataset-and-codebook-microsoft-excel/nyts2017.xlsx"  
[5] "docs/2018-nyts-dataset-and-codebook-microsoft-excel/nyts2018.xlsx"  
[6] "docs/2019-nyts-dataset-and-codebook-microsoft-excel/nyts2019.xlsx"  

All the files were read using read_excel of the readxl package.Using the map() function of the purrr package this was done efficiently. This created a single list of tibbles (one for each file).

Each excel file name was extracted using the str_extract() function of the stringr package.

[1] "nyts2015" "nyts2015" "nyts2016" "nyts2017" "nyts2018" "nyts2019"

These names became the names of the tibbles in the list of tibbles.

Specific columns were selected from each of the tibbles using the varaible name, as identified in the codebook for being of interest. In some cases functions like starts_with() of the dplyr package were used to select several variables.

tbl[["nyts2015"]] <- tbl[["nyts2015"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Qn1, #Age
                  Qn2, #Sex
                  Qn3, #Grade
                  starts_with("Qn4"), #Hispanic/Latino
                  starts_with("Qn5"), #Race,
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  EFLAVCIGTS,
                  CFLAVCIGTS,
                  EFLAVCIGAR,
                  )

tbl[["nyts2016"]] <- tbl[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  EFLAVCIGAR,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) 

tbl[["nyts2017"]] <- tbl[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  CBIDIS,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl[["nyts2018"]] <- tbl[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl[["nyts2019"]] <- tbl[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  EHTP,
                  CHTP,
                  Q40, #Brang, e-cigarettes
                  Q62A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, #Clove or spice
                  Q62C, #Fruit 
                  Q62D, #Chocolate
                  Q62E, #Alcoholic Drink
                  Q62F, #Candy/Desserts/Other Sweets
                  Q62G, #Some Other Flavor Not Listed Here 
                  )

A directory was created using the base dir.create() function called data_reduced for the csv files. New csv files were created for each of the tbls in the list using the write_csv() function of the readr package. This was done all at once using the base mappy() function.

Now we will show how to read in the data from the five CSV files that were created from the five different excel files.

Reading in the CSV files

nyts_data[["nyts2016"]] <- nyts_data[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  -EFLAVCIGAR,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) %>%
    rename(Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           brand_ecig=NA) %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2016"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2016"]] <- nyts_data[["nyts2016"]] %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
  mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA,
                      .missing = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)))

# sapply(nyts_data[["nyts2016"]], function(x)
#     summary(
#         factor(x)
#         )
#     )
nyts_data[["nyts2017"]] <- nyts_data[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  CBIDIS,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) %>%
    rename(Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           brand_ecig=NA) %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2017"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2017"]] <- nyts_data[["nyts2017"]] %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
  mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA,
                      .missing = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)))

# sapply(nyts_data[["nyts2017"]], function(x)
#     summary(
#         factor(x)
#         )
#     )
nyts_data[["nyts2018"]] <- nyts_data[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) %>%
    rename(Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           brand_ecig=NA) %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2018"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2018"]] <- nyts_data[["nyts2018"]] %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
    mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA,
                      .missing = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
            list(~recode(.,
                         `1` = TRUE,
                         `2` = FALSE,
                         .missing = NA))) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)))

# sapply(nyts_data[["nyts2018"]], function(x)
#     summary(
#         factor(x)
#         )
#     )
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  -EHTP,
                  -CHTP,
                  Q40, #Brang, e-cigarettes
                  Q62A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, #Clove or spice
                  Q62C, #Fruit 
                  Q62D, #Chocolate
                  Q62E, #Alcoholic Drink
                  Q62F, #Candy/Desserts/Other Sweets
                  Q62G, #Some Other Flavor Not Listed Here 
                  )  %>%
    rename(brand_ecig=Q40,
           Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q62A,
           clove_spice=Q62B,
           fruit=Q62C,
           chocolate=Q62D,
           alcoholic_drink=Q62E,
           candy_dessert_sweets=Q62F,
           other=Q62G) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           no_use="missing") %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2019"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
  mutate_all(~ replace(., . %in% c(".N",".S",".Z"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
  mutate(psu=as.character(psu),
         Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                           .default = NA))) %>%
    mutate(brand_ecig = recode(brand_ecig,
                                             `1` = "Other", #levels 1,8 combined to `Other` 
                                             `2` = "Blu",
                                             `3` = "JUUL",
                                             `4` = "Logic",
                                             `5` = "MarkTen",
                                             `6` = "NJOY",
                                             `7` = "Vuse",
                                             `8` = "Other")) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                           .default = FALSE,
                           .missing =FALSE))) #Ask Michael about this if unclear

# sapply(nyts_data[["nyts2019"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

Reminder: Current users are a subset of ever users.

Data Exploration and Wrangling


We will also use the %>% pipe which can be used to define the input for later sequential steps. This will make more sense when we have multiple sequential steps using the same data object. To use the pipe notation we need to install and load dplyr as well.

Data Visualization


Question 1

For lots of pivot_longer() that are the same… can use pivot_longer_spec()

spec <- relig_income %>% build_longer_spec( cols = -religion, names_to = “income”, values_to = “count” ) pivot_longer_spec(relig_income, spec)

But need to decide if we are going to create some new data frames…

Question 3

What vaping brands and flavors appear to be used the most frequently?

Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.

Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.

Paper

plot5 <- nyts_data %>%
  filter(year!=2015) %>%
  filter(menthol==TRUE|
           clove_spice==TRUE|
           fruit==TRUE|
           chocolate==TRUE|
           alcoholic_drink==TRUE|
           candy_dessert_sweets==TRUE|
           other==TRUE) %>%
  mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
                                      non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           ecig_only_current = case_when(ecig_current == TRUE &
                                           non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                            ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                               ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE)) %>%
  mutate(Group = case_when(ecig_only_ever==TRUE |
                             ecig_only_current==TRUE ~ "Only e-cigarettes",
                         non_ecig_only_ever==TRUE |
                           non_ecig_only_current==TRUE ~ "Only other products",
                                    TRUE ~ "Both")) %>%
  filter(Group!="Both") %>%
  group_by(year, Group) %>%
  summarise(`Menthol`=(sum(menthol, na.rm = TRUE)*100)/
              sum(!is.na(menthol)),
              `Clove or Spice`=(sum(clove_spice, na.rm = TRUE)*100)/
              sum(!is.na(clove_spice)),
              `Fruit`=(sum(fruit, na.rm = TRUE)*100)/sum(!is.na(fruit)),
              `Chocolate`=(sum(chocolate, na.rm = TRUE)*100)/
              sum(!is.na(chocolate)),
              `Alcoholic Drink`=(sum(alcoholic_drink, na.rm = TRUE)*100)/
              sum(!is.na(alcoholic_drink)),
              `Candy/Desserts/Sweets`=(sum(candy_dessert_sweets, na.rm = TRUE)*100)/
              sum(!is.na(candy_dessert_sweets)),
            `Other`=(sum(other, na.rm = TRUE)*100)/
              sum(!is.na(other)),
            Respondents=n()) %>%
  #converting all columns between and including Menthol and Other to one column called Flavor
  pivot_longer(cols = Menthol:Other, names_to = "Flavor", values_to = "Percentage of students") %>%
  filter(!is.na(`Percentage of students`),
         Flavor!="Other") %>%
  group_by(Flavor) %>%
  mutate(affirmative=(Respondents * `Percentage of students`)/100) %>%
  mutate(flavor_mean = sum(affirmative)/sum(Respondents)) %>%
  ungroup() %>%
  mutate(flavor_mean_rank = dense_rank(flavor_mean),
         Flavor = fct_reorder(Flavor, flavor_mean_rank)) %>%
  ggplot(aes(x=year, y=`Percentage of students`, color=Group)) +
  facet_wrap(.~Flavor,ncol=3) +
  geom_line() + 
  geom_point(show.legend = FALSE) + 
  theme_minimal() +
  theme(legend.position = "bottom",
          axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90)) + 
  labs(title = "Among users of only one type of product, what vaping flavors appear to be used the most frequently?",
       subtitle = "Percent reporting only e-cigarette use vs only other nicotine product use")

plot5

Question 4

Have vaping rates possibly influenced tobacco/nicotine use?

plot7 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    filter(User=="Ever") %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product)) +
    geom_line(linetype=1) + # geom_bar(stat="identity", position = "dodge", color="black") +
  geom_point(show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette ever use compare to ever use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot7

plot8 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product, linetype=User)) +
    geom_line() +
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette use compare to use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot8

Weighted Sample

plotA_w <- nyts_data %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students") %>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users more prevalent after 2017",
         y = "% of students")

plotB_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying e-cigarettes increases &\n% ever trying other products decreases",
         y = "% of students")

#### the wrangling looks the same as the above plot...
plotC_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Using e-cigarettes increases &\n% using Other products decreases",
         y = "% of students")

title_w <- ggdraw() + 
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w <- plot_grid(plotA_w,
                     rel_widths = c(1),
                     align = "v",
                     axis = "bt")
plotsBC_w <- plot_grid(plotB_w,
                     plotC_w,
                     rel_widths = c(1,1),
                     align = "v",
                     axis = "bt")

legend_w <- get_legend(plotB_w +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w <- plot_grid(title_w,
                      plotsA_w,
                      plotsBC_w,
                      legend_w,
                      ncol = 1,
                      rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                      scale = 1.0)

figure_w

Hypothethical Cohort

plotA_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students")%>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users becoming increasingly prevalent",
         y = "% of students")

plotB_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
   pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
  ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying nicotine products increases",
         y = "% of students")

plotC_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "E-cigarette use surpasses use of other nicotine products",
         y = "% of students")

title_w_8 <- ggdraw() + 
  draw_label(
    expression("Among"~8^th~"graders in 2015, have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w_8 <- plot_grid(plotA_w_8,
                        rel_widths = c(1),
                        align = "v",
                        axis = "bt")

plotsBC_w_8 <- plot_grid(plotB_w_8,
                         plotC_w_8,
                         rel_widths = c(1,1),
                         axis = "bt")

legend_w_8 <- get_legend(plotB_w_8 +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w_8 <- plot_grid(title_w_8,
                        plotsA_w_8,
                        plotsBC_w_8,
                        legend_w_8,
                        ncol = 1,
                        rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                        scale = 1.0
)

figure_w_8

Final Figure

title_final <- ggdraw() +
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=16,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_uw_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, unweighted"),
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_8_final <- ggdraw() + 
  draw_label(
    expression(~8^th~"graders in 2015, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_final <- plot_grid(subtitle_uw_final,
                            subtitle_w_final,
                            subtitle_w_8_final,
                            ncol = 3)

plotsA_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of e-cigarette use by user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_final <- plot_grid(plotA_uw + theme(plot.title = element_blank()),
                          plotA_w + theme(plot.title = element_blank()),
                          plotA_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsB_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of ever use by product type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsB_final <- plot_grid(plotB_uw + theme(plot.title = element_blank()),
                          plotB_w + theme(plot.title = element_blank()),
                          plotB_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsC_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of nicotine product use by product & user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsC_final <- plot_grid(plotC_uw + theme(plot.title = element_blank()),
                          plotC_w + theme(plot.title = element_blank()),
                          plotC_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

legend_final <- get_legend(plotB_w +
                             theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

final_plot <- plot_grid(title_final,
          plotsA_title_final,
          subtitle_final,
          plotsA_final,
          plotsB_title_final,
          subtitle_final,
          plotsB_final,
          plotsC_title_final,
          subtitle_final,
          plotsC_final,
          legend_final,
          ncol = 1,
          rel_heights = c(0.2,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.1))

final_plot

Suggested Homework


  • Apply survey weights to one of the figures produced in this case study in which weighted estimates were not produced. Include error bars in the updated figure.
    • Does the figure change after the application of survey weights?
    • If so, describe how.
  • Reproduce final_plot above for a different cohort of your choice.

Notes

Ever and current variables are limited to those shared by all years of data included in this case study.

New code: Knit time: 33.841 secs. Previous code: ~ 3 m.

Problems

I had difficulty producing a plot that succinctly presented a trend. It’s very easy to produce plots that are very useful once one is familiar with the data. Some plots, however, cannot stand alone and need additional context to be clear for those without prior knowledge of the data. When I first shared a plot I had been working on with others, it became clear that in my effort to present a complicated idea briefly I had left out information that would make the trend easily interpretable. To solve this issue, I began to present visualizations of the data alongside my original plot. The final figure I created contained several additional plots, each presenting the same trend at a different level than my initial plot.

My “centerpiece” plot is the middle plot in final_plot. The 8 plots around it help provide a very clear picture of what is going on in the US with regards to e-cigarette use and nicotine product use at large. On its own, it’s difficult to understand the trends in the US and how important the weighting scheme is for inference. Once you add the left and right columns, it’s clear what is going on.

Data Analysis


content header


Summary


Session info


R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.0.0      ggplot2_3.3.1      srvyr_0.3.10       summarytools_0.9.6
 [5] readr_1.3.1        forcats_0.5.0      tidyr_1.1.0        stringr_1.4.0     
 [9] dplyr_1.0.0        purrr_0.3.4        readxl_1.3.1       knitr_1.28        
[13] here_0.1          

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0   xfun_0.14          pander_0.6.3       mitools_2.4       
 [5] splines_4.0.1      lattice_0.20-41    tcltk_4.0.1        colorspace_1.4-1  
 [9] vctrs_0.3.0        generics_0.0.2     htmltools_0.4.0    yaml_2.2.1        
[13] base64enc_0.1-3    survival_3.1-12    rlang_0.4.6        pillar_1.4.4      
[17] glue_1.4.1         withr_2.2.0        DBI_1.1.0          pryr_0.1.4        
[21] matrixStats_0.56.0 lifecycle_0.2.0    plyr_1.8.6         munsell_0.5.0     
[25] gtable_0.3.0       cellranger_1.1.0   codetools_0.2-16   evaluate_0.14     
[29] labeling_0.3       highr_0.8          Rcpp_1.0.4.6       backports_1.1.7   
[33] scales_1.1.1       checkmate_2.0.0    magick_2.3         farver_2.0.3      
[37] rapportools_1.0    hms_0.5.3          digest_0.6.25      stringi_1.4.6     
[41] survey_4.0         rprojroot_1.3-2    grid_4.0.1         tools_4.0.1       
[45] magrittr_1.5       tibble_3.0.1       crayon_1.3.4       pkgconfig_2.0.3   
[49] ellipsis_0.3.1     Matrix_1.2-18      lubridate_1.7.8    rmarkdown_2.2     
[53] R6_2.4.1           compiler_4.0.1    
---
title: "Open Case Studies : Vaping Behaviors in American Youth"
author: "Michael Ontiveros, Carrie Wright, PhD. "

css: style.css
output:
  html_document:
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/logo.jpg");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>

```{r, echo=FALSE}
knit_time_start <- Sys.time()
```

```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width=10, fig.height=8, dpi=300) 
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
```


#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "final_plot.png"))
```
####




## {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

## **Motivation**
*** 
According to a recent [report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w){target="_blank"}, overall tobacco use **increased** in youths (middle schooland high school students) in the United States in 2017 and 2018, despite previous years of declining use.

This major increase is attributed to an increase in the use of electronic cigaarette (e-cigarette) products.

E-cigarettes are referred to by many different names, including but not limited to:

1) Electronic nicotine delivery systems (ENDS)
2) Vapes
3) e-hookahs
4) vape pens
5) tanks
6) mods

The devices vary greatly:

```{r, echo = FALSE, fig.align ="center"}

include_graphics("https://www.lung.org/getmedia/8ac8ab8c-e7fc-497b-8384-441615f50ff0/ecigs_K.jpg.jpg")
```

##### [[source](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health)]

See this [CDC guide](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/pdfs/ecigarette-or-vaping-products-visual-dictionary-508.pdf){target="_blank"} and the [American Lung Association website](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health){target="_blank"} for more information. 

The report found that:

> During 2017–2018, current use of any tobacco product **increased 38.3%** (from 19.6% to 27.1%) among high school students and **28.6%** (from 5.6% to 7.2%) among middle school students; e-cigarette use **increased 77.8%** (from 11.7% to 20.8%) among high school students and **48.5%** (from 3.3% to 4.9%) among middle school students.


In 2018, the [Federal Drug Administration (FDA) in the United States](https://acsjournals.onlinelibrary.wiley.com/doi/full/10.1002/cncr.31868){target="_blank"} stated that e-cigarette usage use among youth reached:  

> “nothing short of an **epidemic proportion of growth**”


In this case study, we will be invistigating the same data used in the report that generated the above findings. This data comes from the [The National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"}.

#### {.reference_block}

Gentzke, Andrea S., Melisa Creamer, Karen A. Cullen, Bridget K. Ambrose, Gordon Willis, Ahmed Jamal, and Brian A. King.  “Vital Signs: Tobacco Product Use Among Middle and High School Students - United States, 2011-2018.” **MMWR. Morbidity and Mortality Weekly Report** 68 (6): 157–64 (2019).

####


## **Main Questions**
*** 

#### {.main_question_block}
<b><u> Our main question: </u></b>

1) How has tobacco/nicotine use by American youths changed since 2015? 
2) How do vaping rates compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?  
We will base this on the following survey questions:   
> "During the past 30 days, what brand of e-cigarettes did you usually use?"   
>"What flavors of tobacco products have you used in the past
30 days?"  

4) Have vaping rates possibly influenced tobacco/nicotine use?

####


## **Learning Objectives** 
*** 

In this case study, we will cover how to import data from multiple files efficiently, how to import data from excel files, and how to make a variety of visualizations to compare multiple groups across time. We will also demonstrate how to work with codebooks. We will cover the concept of survey weighting and introduce the `srvyr` package. We will discuss the difference between pooled cross-sectional data and panel data. We will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"} for wrangling data, such as `tidyr` and `dpyr` and for visualization, such as as `ggplot2`. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.


```{r, out.width = "20%", echo = FALSE, fig.align ="center"}

include_graphics("https://tidyverse.tidyverse.org/logo.png")
```


*** 


We will begin by loading the packages that we will need:

```{r}
library(here)
library(readxl)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
library(forcats)
library(readr)
library(summarytools)
library(srvyr)
library(ggplot2)
library(cowplot)
```

 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readxl](https://readxl.tidyverse.org/){target="_blank"}      | to import the data in the excel files  
[purrr](https://purrr.tidyverse.org/){target="_blank"}   | to import the data in all the different excel and csv files efficiently
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[summarytools](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data in a different style   
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to rearrange data in wide and long formats 


[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the character strings within the data   
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers  
[ggpubr](https://cran.r-project.org/web/packages/ggpubr/index.html){target="_blank"}    | to easily add regression line equations to plots  
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to change details about factors (categorical variables)  
[lmerTest](https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)| to perform linear mixed model testing   
[car](https://cran.r-project.org/web/packages/car/car.pdf)| to perform Levene's Test of Homogeneity of Variances   
[ggiraph](https://cran.r-project.org/web/packages/ggiraph/index.html)| to make plots interactive   
[ggforce](https://cran.r-project.org/web/packages/ggforce/ggforce.pdf)| to modify facets in plots  
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)| to plot in color palette    
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined   [skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data   




The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


## **Context**
*** 

According to the cited [Morbidity and Mortality Weekly Report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w) this was what was already known about this topic and the implications of this study:

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "context.png"))

```


Importantly, the vapors used in e-cigarettes contain harmful chemicals:

```{r, echo = FALSE, fig.align ="center"}

include_graphics("https://www.cdc.gov/tobacco/basic_information/e-cigarettes/images/e-cigarette-aerosol-can-contain-harmful-ingredients-desktop-700.jpg")
```

E-cigarette usage has also been associated with [lung injury]((https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)){target="_blank"}

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "lung.png"))
```

See [here](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/Quick-Facts-on-the-Risks-of-E-cigarettes-for-Kids-Teens-and-Young-Adults.html){target="_blank"} for additonal information about the potential health effects of e-cigarettes in teens and young adults.

## **Limitations**
*** 
There are some important considerations regarding this data analysis to keep in mind: 

AVOCADO..Need Michael to look at these...

1) The data included in the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} does not follow the same individual students over time.  A [longitudinal study](https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/7-longitudinal-studies){target="_blank"} that does follow the same individuals over time collects data called [panel data](https://en.wikipedia.org/wiki/Panel_data). The data in this study is called pooled [cross-sectional data]https://en.wikipedia.org/wiki/Cross-sectional_data, this data is obtained from random collection of obervations across time.

According to wikipedia:
>Panel data differs from pooled cross-sectional data across time, because it deals with the observations on the same subjects in different times whereas the latter observes different subjects in different time periods


AVOCADO revisit this...
2) The data also includes percentages of students that reported use of particular tobacco product, but the survey questions did not ask  how much uses of one product compared to another - for example menthol flavored products where used at the near exclusion of other flavors. 

While [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data  used in this analysis only contains information for groups of individuals who answered the survey questions as male or female. 


AVOCADO.. should we drop the race data? we didnt use it...
Furthermore, while the race of individual students was surveyed and asked students to self report based on what race or races they considered themselves to be, this list only included the following options:    
1) Hispanic, Latino, Latina, or Spanish origin as one single race with the following further options:  
    1) Mexican, Mexican American, Chicano, or Chicana  
    2) Puerto Rican  
    3) Cuban  
    4) Yes, Another Hispanic, Latino, Latina, or Spanish origin  
2)


## **What are the data?**
*** 
 
The data in this case study comes from the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} which is an annual survey that asks students  in high school and middle school (grades 6-12) about tobacco usage in the United States of America.

The data for this survey is freely available online at this [website](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html){target="_blank"} with data from 1999, 2000, 2002, 2004, 2006, 2009,  and 2011-2019. We will be using data from **2015-2019** due to the fact that these years are the most recent that asked questions recarding e-cigarette usage.

Each year includes documentation, such as a [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html) and an excel file containing the data:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "data.png"))
```
Therefore, since we are using data from **2015-2019**, the data we are interested in is located in 5 different excel files, each with their own [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html).

The [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html) contains information describing the data within the excel file. 

As you can see the excel file contains very short variable names and values, and it is not clear what they mean without the codebook:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "excel.png"))
```

The codebook explains what the variables (the columns) are:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "variables.png"))
```

And the codebook explains what the values for each variable are:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "qn1.png"))
```

We will explain more later about what the values on the right indicate.

The reason that there are codebooks for each year is because the questions asked year varied slightly.


The data in this survey is what is called pooled cross-sectional data. In otherwords, different subsets of students are surveyed each year and it is not clear which, if any , individuals particpate from one year to the next.

## **Data Import**
*** 
Since these excel files are so large, it takes a bit of time for the data to load. To make the process faster, we previously imported these files, selected only our questions of interest, and saved this data as csv files. 

AVOCADO should we drop the race data? we didn't end up doing anything with this....

<details><summary> Click here for details on how the data was originally imported </summary>

First we created a list of filenames of all the different excel files. Using the `here()` function of the `here` package, we looked in all the directories of the project.
The `list.files()` function looked for all files with .xlsx within these subdirectories.
```{r}
excel_files<-list.files(here::here(), recursive = TRUE,
                  pattern = "*.xlsx")
excel_files
```

All the files were read using read_excel of the readxl package.Using the map() function of the purrr package this was done efficiently.
This created a single list of tibbles (one for each file). 
```{r, eval = FALSE}
tbl <- excel_files %>% 
       map(~readxl::read_excel(.))
```

Each excel file name was extracted using the str_extract() function of the stringr package.
```{r}
tbl_names <- excel_files %>%
  str_extract("nyts[2][0][1][5-9]")
tbl_names
```

These names became the names of the tibbles in the list of tibbles.
```{r, eval = FALSE}
names(tbl) <- tbl_names
```


Specific columns were selected from each of the tibbles using the varaible name, as identified in the codebook for being of interest.
In some cases functions like starts_with() of the dplyr package were used to select several variables.

```{r, eval = FALSE}

tbl[["nyts2015"]] <- tbl[["nyts2015"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Qn1, #Age
                  Qn2, #Sex
                  Qn3, #Grade
                  starts_with("Qn4"), #Hispanic/Latino
                  starts_with("Qn5"), #Race,
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  EFLAVCIGTS,
                  CFLAVCIGTS,
                  EFLAVCIGAR,
                  )

tbl[["nyts2016"]] <- tbl[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  EFLAVCIGAR,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) 

tbl[["nyts2017"]] <- tbl[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  CBIDIS,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl[["nyts2018"]] <- tbl[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl[["nyts2019"]] <- tbl[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  EHTP,
                  CHTP,
                  Q40, #Brang, e-cigarettes
                  Q62A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, #Clove or spice
                  Q62C, #Fruit 
                  Q62D, #Chocolate
                  Q62E, #Alcoholic Drink
                  Q62F, #Candy/Desserts/Other Sweets
                  Q62G, #Some Other Flavor Not Listed Here 
                  )
```

A directory was created using the base `dir.create()` function called `data_reduced` for the csv files. 
New csv files were created for each of the tbls in the list using the `write_csv()` function of the `readr` package.
This was done all at once using the base `mappy()` function.

```{r, eval = FALSE}
# 
dir.create("docs/data_reduced")

mapply(write_csv, tbl, path=here(paste0("docs/data_reduced/",
                                  names(tbl),
                                   '.csv'))
     )
# tbl %>%
#   names(.) %>%
# map(~write_csv(tibble(.) , path=paste0("docs/data_reduced2/",., ".csv")))
#                                   
```

</details>



Now we will show how to read in the data from the five CSV files that were created from the five different excel files.

### Reading in the CSV files

```{r, echo=FALSE, include=FALSE}
start_time <- Sys.time()

end_time <- Sys.time()

test_time <- end_time - start_time

time_message <- paste("Duration of data import:",
      round(as.numeric(test_time) ,3),
      units(test_time)
      )
```


```{r}
csvs <- list.files(here::here(),recursive = TRUE,
                   pattern = "*.csv")%>%
  map(~read_csv(.))


csvs_names <- list.files(recursive = TRUE,
                         pattern = "*.csv")%>%
  str_extract("nyts[2][0][1][5-9]")

names(csvs) <- csvs_names

nyts_data <- csvs
rm(csvs)
rm(csvs_names)
```


```{r}

nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
    # dplyr::select(psu,
    #               finwgt,
    #               stratum,
    #               Qn1, #Age
    #               Qn2, #Sex
    #               Qn3, #Grade
    #               starts_with("Qn4"), #Hispanic/Latino
    #               starts_with("Qn5"), #Race
    #               starts_with("E",
    #                           ignore.case = FALSE),
    #               starts_with("C",
    #                           ignore.case = FALSE),
    #               -EFLAVCIGTS,
    #               -CFLAVCIGTS,
    #               -EFLAVCIGAR,
    #               ) %>%
    rename(Age=Qn1,
           female=Qn2,
           Grade=Qn3,
           Not_HL=Qn4a,
           HL_Mex=Qn4b,
           HL_PR=Qn4c,
           HL_Cub=Qn4d,
           HL_Other=Qn4e,
           Race_AIAN=Qn5a,
           Race_Asian=Qn5b,
           Race_BAA=Qn5c,
           Race_NHOPI=Qn5d,
           Race_White=Qn5e) %>%
    mutate(Age=Age+8,
           Grade=Grade+5,
           brand_ecig=NA,
           menthol=NA,
           clove_spice=NA,
           fruit=NA,
           chocolate=NA,
           alcoholic_drink=NA,
           candy_dessert_sweets=NA,
           other=NA,
           no_use=NA) %>%
  dplyr::select(-starts_with("Q"))
```


#### {.scrollable }
```{r}
# Scroll through the output!


```
####

```{r, eval = FALSE}
sapply(nyts_data[["nyts2015"]], function(x)
    summary(
        factor(x)
        )
    )
```

```{r}
#Note about difference between recode and fct_recode
nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
  mutate_all(~ replace(., . %in% c("."), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
  mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA,
                      .missing = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                           .default = NA,
                      .missing = NA)))
```

```{r, eval = FALSE}
sapply(nyts_data[["nyts2015"]], function(x)
    summary(
        factor(x)
        )
    )
```

```{r}
nyts_data[["nyts2016"]] <- nyts_data[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  -EFLAVCIGAR,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) %>%
    rename(Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           brand_ecig=NA) %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2016"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2016"]] <- nyts_data[["nyts2016"]] %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
  mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA,
                      .missing = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)))

# sapply(nyts_data[["nyts2016"]], function(x)
#     summary(
#         factor(x)
#         )
#     )
```

```{r}
nyts_data[["nyts2017"]] <- nyts_data[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  CBIDIS,
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) %>%
    rename(Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           brand_ecig=NA) %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2017"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2017"]] <- nyts_data[["nyts2017"]] %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
  mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA,
                      .missing = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)))

# sapply(nyts_data[["nyts2017"]], function(x)
#     summary(
#         factor(x)
#         )
#     )
```

```{r}
nyts_data[["nyts2018"]] <- nyts_data[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) %>%
    rename(Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           brand_ecig=NA) %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2018"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2018"]] <- nyts_data[["nyts2018"]] %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
    mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA,
                      .missing = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
            list(~recode(.,
                         `1` = TRUE,
                         `2` = FALSE,
                         .missing = NA))) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)))

# sapply(nyts_data[["nyts2018"]], function(x)
#     summary(
#         factor(x)
#         )
#     )
```

```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("Q4"), #Hispanic/Latino
                  starts_with("Q5"), #Race
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  -EHTP,
                  -CHTP,
                  Q40, #Brang, e-cigarettes
                  Q62A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, #Clove or spice
                  Q62C, #Fruit 
                  Q62D, #Chocolate
                  Q62E, #Alcoholic Drink
                  Q62F, #Candy/Desserts/Other Sweets
                  Q62G, #Some Other Flavor Not Listed Here 
                  )  %>%
    rename(brand_ecig=Q40,
           Age=Q1,
           female=Q2,
           Grade=Q3,
           Not_HL=Q4A,
           HL_Mex=Q4B,
           HL_PR=Q4C,
           HL_Cub=Q4D,
           HL_Other=Q4E,
           Race_AIAN=Q5A,
           Race_Asian=Q5B,
           Race_BAA=Q5C,
           Race_NHOPI=Q5D,
           Race_White=Q5E,
           female=Q2,
           menthol=Q62A,
           clove_spice=Q62B,
           fruit=Q62C,
           chocolate=Q62D,
           alcoholic_drink=Q62E,
           candy_dessert_sweets=Q62F,
           other=Q62G) %>%
    mutate(Age = as.numeric(Age) + 8,
           Grade = as.numeric(Grade) + 5,
           no_use="missing") %>%
  dplyr::select(-starts_with("Q"))

# sapply(nyts_data[["nyts2019"]], function(x)
#     summary(
#         factor(x)
#         )
#     )

nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
  mutate_all(~ replace(., . %in% c(".N",".S",".Z"), NA)) %>%
  mutate(Age=as.character(Age),
         Grade=as.character(Grade)
         ) %>%
  mutate(psu=as.character(psu),
         Age=recode(Age,
                    `19` = ">18",
                    ),
         female=recode(female,
                      `1`= FALSE,
                      `2` = TRUE,
                      .default = NA),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other"),
         Not_HL=recode(Not_HL,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE)) %>%
  mutate_at(vars(starts_with("HL", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
  mutate_at(vars(starts_with("Race", ignore.case = FALSE)),
              list(~recode(.,
                       `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) %>%
    mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                           .default = NA))) %>%
    mutate(brand_ecig = recode(brand_ecig,
                                             `1` = "Other", #levels 1,8 combined to `Other` 
                                             `2` = "Blu",
                                             `3` = "JUUL",
                                             `4` = "Logic",
                                             `5` = "MarkTen",
                                             `6` = "NJOY",
                                             `7` = "Vuse",
                                             `8` = "Other")) %>%
    mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                           .default = FALSE,
                           .missing =FALSE))) #Ask Michael about this if unclear

# sapply(nyts_data[["nyts2019"]], function(x)
#     summary(
#         factor(x)
#         )
#     )
```

```{r}
nyts_data <- nyts_data %>%
  map_df(bind_rows, .id = "year") %>%
  mutate(year=as.numeric(str_remove(year,"nyts")))

#sapply(nyts_data, class)

# sapply(nyts_data, function(x)
#     summary(
#         factor(x)
#         )
#     )


```

<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
<div class = "blue">

Reminder: Current users are a subset of ever users. 

</div>



## **Data Exploration and Wrangling**
*** 

We will also use the `%>%` pipe which can be used to define the input for later sequential steps. This will make more sense when we have multiple sequential steps using the same data object. To use the pipe notation we need to install and load dplyr as well.



## **Data Visualization**
*** 

### Question 1 

For lots of pivot_longer() that are the same... can use pivot_longer_spec()

spec <- relig_income %>% build_longer_spec(
  cols = -religion,
  names_to = "income",
  values_to = "count"
)
pivot_longer_spec(relig_income, spec)

But need to decide if we are going to create some new data frames...

```{r}
plot1 <- nyts_data %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(tobacco_ever_year=(sum(tobacco_ever, na.rm = TRUE)*100)/
                sum(!is.na(tobacco_ever)),
              tobacco_current_year=(sum(tobacco_current, na.rm = TRUE)*100)/
                sum(!is.na(tobacco_current))) %>%
    rename("Ever"=tobacco_ever_year,
           "Current"=tobacco_current_year) %>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
    ggplot(aes(x=year,y=`Percentage of students`, linetype=User)) +
    geom_line() + 
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does nicotine use vary over the years?",
         subtitle = "Current and ever users of nicotine products",
         y = "% of students")

plot1 
```

### Question 2

```{r}
plot2 <- nyts_data %>%
    group_by(year,
             female) %>%
    summarise(EELCIGT_year=(sum(EELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(EELCIGT)),
              CELCIGT_year=(sum(CELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(CELCIGT))) %>% 
    filter(!is.na(female)) %>%
    rename("E-cigarettes, Ever"=EELCIGT_year,
           "E-cigarettes, Current"=CELCIGT_year) %>%
  #converting all columns between and including `E-cigarettes, Ever` and `E-cigarettes, Current` into one column called category
    pivot_longer(cols = `E-cigarettes, Ever`:`E-cigarettes, Current`, names_to = "Category", values_to = "Percentage of students")%>%
    mutate(User = case_when(Category == "E-cigarettes, Ever" ~ "Ever",
                               Category == "E-cigarettes, Current" ~ "Current")) %>%
    mutate(Sex = case_when(female == TRUE ~ "Females",
                               female == FALSE ~ "Males")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Sex, linetype=User)) +
    geom_line() + 
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.text.x = element_text(angle = 90),
          axis.title.x = element_blank()) +
    labs(title = "How do vaping rates compare between males and females?",
         subtitle = "Current and ever users by gender",
         y = "% of students")

plot2
```

### Question 3

What vaping brands and flavors appear to be used the most frequently?

```{r, echo=FALSE, fig.cap="Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.", out.width = '100%'}
knitr::include_graphics(here::here("img", "HuangJ_DuanZ_KwokJ_et_al_TobaccoControl_Figure1.png"))
```

[Paper](https://tobaccocontrol.bmj.com/content/tobaccocontrol/28/2/146.full.pdf)


```{r}
plot3 <- nyts_data %>%
    filter(year==2019) %>%
    group_by(brand_ecig) %>%
    filter(!is.na(brand_ecig)) %>%
    summarise(n = n()) %>%
    mutate(total = sum(n),
           Percent = n*100/total) %>%
    mutate(brand_ecig = fct_reorder(brand_ecig, desc(Percent))) %>%
    ggplot(aes(x=brand_ecig,y=Percent, fill=brand_ecig)) +
    geom_bar(stat="identity") +
    theme_minimal() +
    theme(legend.position = "none",
          axis.text.x = element_text(size=10),
          axis.title.x = element_blank()) +
    labs(title = "What vaping brands appear to be used the most frequently?",
         subtitle = "Brand of e-cigarette most frequently used in the last 30 days (2019)",
         y = "% of e-cigarette users responding")

plot3
```

```{r}
plot4 <- nyts_data %>%
  filter(year!=2015) %>%
  group_by(year) %>%
  summarise(Menthol=(sum(menthol, na.rm = TRUE)*100)/
                sum(!is.na(menthol)),
              `Clove or Spice`=(sum(clove_spice, na.rm = TRUE)*100)/
                sum(!is.na(clove_spice)),
              `Fruit`=(sum(fruit, na.rm = TRUE)*100)/
                sum(!is.na(fruit)),
              `Chocolate`=(sum(chocolate, na.rm = TRUE)*100)/
                sum(!is.na(chocolate)),
              `Alcoholic Drink`=(sum(alcoholic_drink, na.rm = TRUE)*100)/
                sum(!is.na(alcoholic_drink)),
              `Candy/Desserts/Sweets`=(sum(candy_dessert_sweets, na.rm = TRUE)*100)/sum(!is.na(candy_dessert_sweets))) %>%
  pivot_longer(cols = -year, names_to = "Percentage of students", values_to = "Flavor") %>%

  ggplot(aes(x=year, y=`Percentage of students`, color=Flavor)) +
  geom_line() +
  geom_point(show.legend = FALSE) +
  theme_minimal() +
  theme(legend.position = "bottom",
          axis.text.x = element_text(angle = 90),
          axis.title.x = element_blank()) + 
  labs(title = "What flavors appear to be used the most frequently in nicotine products?",
       subtitle = "Flavors of tobacco products used in the past 30 days")

plot4 
```

```{r}
plot5 <- nyts_data %>%
  filter(year!=2015) %>%
  filter(menthol==TRUE|
           clove_spice==TRUE|
           fruit==TRUE|
           chocolate==TRUE|
           alcoholic_drink==TRUE|
           candy_dessert_sweets==TRUE|
           other==TRUE) %>%
  mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
                                      non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           ecig_only_current = case_when(ecig_current == TRUE &
                                           non_ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                            ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE),
           non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                               ecig_ever ==FALSE ~ TRUE,
                                    TRUE ~ FALSE)) %>%
  mutate(Group = case_when(ecig_only_ever==TRUE |
                             ecig_only_current==TRUE ~ "Only e-cigarettes",
                         non_ecig_only_ever==TRUE |
                           non_ecig_only_current==TRUE ~ "Only other products",
                                    TRUE ~ "Both")) %>%
  filter(Group!="Both") %>%
  group_by(year, Group) %>%
  summarise(`Menthol`=(sum(menthol, na.rm = TRUE)*100)/
              sum(!is.na(menthol)),
              `Clove or Spice`=(sum(clove_spice, na.rm = TRUE)*100)/
              sum(!is.na(clove_spice)),
              `Fruit`=(sum(fruit, na.rm = TRUE)*100)/sum(!is.na(fruit)),
              `Chocolate`=(sum(chocolate, na.rm = TRUE)*100)/
              sum(!is.na(chocolate)),
              `Alcoholic Drink`=(sum(alcoholic_drink, na.rm = TRUE)*100)/
              sum(!is.na(alcoholic_drink)),
              `Candy/Desserts/Sweets`=(sum(candy_dessert_sweets, na.rm = TRUE)*100)/
              sum(!is.na(candy_dessert_sweets)),
            `Other`=(sum(other, na.rm = TRUE)*100)/
              sum(!is.na(other)),
            Respondents=n()) %>%
  #converting all columns between and including Menthol and Other to one column called Flavor
  pivot_longer(cols = Menthol:Other, names_to = "Flavor", values_to = "Percentage of students") %>%
  filter(!is.na(`Percentage of students`),
         Flavor!="Other") %>%
  group_by(Flavor) %>%
  mutate(affirmative=(Respondents * `Percentage of students`)/100) %>%
  mutate(flavor_mean = sum(affirmative)/sum(Respondents)) %>%
  ungroup() %>%
  mutate(flavor_mean_rank = dense_rank(flavor_mean),
         Flavor = fct_reorder(Flavor, flavor_mean_rank)) %>%
  ggplot(aes(x=year, y=`Percentage of students`, color=Group)) +
  facet_wrap(.~Flavor,ncol=3) +
  geom_line() + 
  geom_point(show.legend = FALSE) + 
  theme_minimal() +
  theme(legend.position = "bottom",
          axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90)) + 
  labs(title = "Among users of only one type of product, what vaping flavors appear to be used the most frequently?",
       subtitle = "Percent reporting only e-cigarette use vs only other nicotine product use")

plot5
```

### Question 4

Have vaping rates possibly influenced tobacco/nicotine use?

```{r}
plot6 <- nyts_data %>%
    group_by(year) %>%
    summarise(ECIGT_year=(sum(ECIGT, na.rm = TRUE)*100)/
                sum(!is.na(ECIGT)),
              EELCIGT_year=(sum(EELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(EELCIGT)),
              CCIGT_year=(sum(CCIGT, na.rm = TRUE)*100)/
                sum(!is.na(CCIGT)),
              CELCIGT_year=(sum(CELCIGT, na.rm = TRUE)*100)/
                sum(!is.na(CELCIGT))) %>%
    rename("Cigarettes, Ever"=ECIGT_year,
           "E-cigarettes, Ever"=EELCIGT_year,
           "Cigarettes, Current"=CCIGT_year,
           "E-cigarettes, Current"=CELCIGT_year) %>%
  pivot_longer(cols= -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(User = case_when(Category == "Cigarettes, Ever" ~ "Ever",
                               Category == "E-cigarettes, Ever" ~ "Ever",
                               Category == "Cigarettes, Current" ~ "Current",
                               Category == "E-cigarettes, Current" ~ "Current")) %>%
    mutate(Product = case_when(Category == "Cigarettes, Ever" ~ "Cigarettes",
                               Category == "E-cigarettes, Ever" ~ "E-cigarettes",
                               Category == "Cigarettes, Current" ~ "Cigarettes",
                               Category == "E-cigarettes, Current" ~ "E-cigarettes")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product, linetype=User)) +
    geom_line() + 
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette use compare to cigarette use?",
         subtitle = "Current and ever users of e-cigarettes and cigarettes",
         y = "% of students")

plot6
```

```{r}
plot7 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    filter(User=="Ever") %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product)) +
    geom_line(linetype=1) + # geom_bar(stat="identity", position = "dodge", color="black") +
  geom_point(show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette ever use compare to ever use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot7
```

```{r}
plot8 <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
    group_by(year) %>%
    summarise(ecig_ever_year=(sum(ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(ecig_ever)),
              ecig_current_year=(sum(ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(ecig_current)),
              non_ecig_ever_year=(sum(non_ecig_ever, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_ever)),
              non_ecig_current_year=(sum(non_ecig_current, na.rm = TRUE)*100)/
                sum(!is.na(non_ecig_current))) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(User = case_when(Category =="ecig_ever_year" ~ "Ever",
                           Category =="non_ecig_ever_year" ~ "Ever",
                           Category =="ecig_current_year" ~ "Current",
                           Category =="non_ecig_current_year" ~ "Current")) %>%
    mutate(Product = case_when(Category =="ecig_ever_year" ~ "E-cigarettes",
                           Category =="non_ecig_ever_year" ~ "Other products",
                           Category =="ecig_current_year" ~ "E-cigarettes",
                           Category =="non_ecig_current_year" ~ "Other products")) %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product, linetype=User)) +
    geom_line() +
  geom_point(show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette use compare to use of other products over the years?",
         subtitle = "E-cigarette and non-e-cigarette use",
         y = "% of students")

plot8
```

### Unweighted Sample

```{r, fig.height=10}
plotA_uw <- plot1 +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "Nicotine product users more prevalent after 2017",
         subtitle = NULL,
         y = "% of students")

plotB_uw <- plot7 + 
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Ever trying e-cigarettes increases &\n% ever trying other products decreases",
         subtitle = NULL,
         y = "% of students")

plotC_uw <- plot8 + 
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Using e-cigarettes increases &\n% using Other products decreases",
         subtitle = NULL,
         y = "% of students")

title_uw <- ggdraw() + 
  draw_label(
    "Have vaping rates possibly influenced tobacco/nicotine use?",
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_uw <- plot_grid(plotA_uw,
                     rel_widths = c(1,1))
plotsBC_uw <- plot_grid(plotB_uw,
                        plotC_uw,
                        rel_widths = c(1,1))

legend_uw <- get_legend(plotB_uw +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_uw <- plot_grid(title_uw,
                       plotsA_uw,
                       plotsBC_uw,
                       legend_uw,
                       ncol = 1,
                       rel_heights = c(0.1,
                                       1,
                                       1,
                                       0.1),
                       scale = 1.0)

figure_uw
```

### Weighted Sample

```{r}
plotA_w <- nyts_data %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students") %>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users more prevalent after 2017",
         y = "% of students")

plotB_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying e-cigarettes increases &\n% ever trying other products decreases",
         y = "% of students")

#### the wrangling looks the same as the above plot...
plotC_w <- nyts_data %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Using e-cigarettes increases &\n% using Other products decreases",
         y = "% of students")

title_w <- ggdraw() + 
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w <- plot_grid(plotA_w,
                     rel_widths = c(1),
                     align = "v",
                     axis = "bt")
plotsBC_w <- plot_grid(plotB_w,
                     plotC_w,
                     rel_widths = c(1,1),
                     align = "v",
                     axis = "bt")

legend_w <- get_legend(plotB_w +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w <- plot_grid(title_w,
                      plotsA_w,
                      plotsBC_w,
                      legend_w,
                      ncol = 1,
                      rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                      scale = 1.0)

figure_w
```

### Hypothethical Cohort

```{r}
plotA_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever ==0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, names_to = "Type", values_to = "Percentage of students")%>%
    # gather(key=Type,
    #        value=`Percentage of students`,
    #        -year) %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                          grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Nicotine product users becoming increasingly prevalent",
         y = "% of students")

plotB_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
    summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
   pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year)  %>%
  mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("current", Category) ~ "Current",
                          TRUE ~ "Ever",),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
  filter(User=="Ever") %>%
  dplyr::rename("Lower_temp" = Upper,
                "Upper_temp" = Lower) %>%
  dplyr::rename("Lower"=Lower_temp,
                "Upper"=Upper_temp) %>%
  ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying nicotine products increases",
         y = "% of students")

plotC_w_8 <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
    mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever ==0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current ==0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever ==0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current ==0 ~ FALSE)) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
  summarise(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_at(vars(-year), "*", 100) %>%
  dplyr::select(year,
                ecig_ever_year,
                ecig_current_year,
                non_ecig_ever_year,
                non_ecig_current_year,
                contains("low"),
                contains("upp")) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    # gather(key=Category,
    #        value=`Percentage of students`,
    #        -year) %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  #spread(Estimate, `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper)) + 
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_minimal() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "E-cigarette use surpasses use of other nicotine products",
         y = "% of students")

title_w_8 <- ggdraw() + 
  draw_label(
    expression("Among"~8^th~"graders in 2015, have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w_8 <- plot_grid(plotA_w_8,
                        rel_widths = c(1),
                        align = "v",
                        axis = "bt")

plotsBC_w_8 <- plot_grid(plotB_w_8,
                         plotC_w_8,
                         rel_widths = c(1,1),
                         axis = "bt")

legend_w_8 <- get_legend(plotB_w_8 +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w_8 <- plot_grid(title_w_8,
                        plotsA_w_8,
                        plotsBC_w_8,
                        legend_w_8,
                        ncol = 1,
                        rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                        scale = 1.0
)

figure_w_8
```

### Final Figure

```{r}
title_final <- ggdraw() +
  draw_label(
    expression("Have vaping rates possibly influenced tobacco/nicotine use?"),
    fontface = 'bold',
    size=16,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_uw_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, unweighted"),
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_8_final <- ggdraw() + 
  draw_label(
    expression(~8^th~"graders in 2015, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_final <- plot_grid(subtitle_uw_final,
                            subtitle_w_final,
                            subtitle_w_8_final,
                            ncol = 3)

plotsA_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of e-cigarette use by user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_final <- plot_grid(plotA_uw + theme(plot.title = element_blank()),
                          plotA_w + theme(plot.title = element_blank()),
                          plotA_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsB_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of ever use by product type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsB_final <- plot_grid(plotB_uw + theme(plot.title = element_blank()),
                          plotB_w + theme(plot.title = element_blank()),
                          plotB_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsC_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of nicotine product use by product & user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsC_final <- plot_grid(plotC_uw + theme(plot.title = element_blank()),
                          plotC_w + theme(plot.title = element_blank()),
                          plotC_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

legend_final <- get_legend(plotB_w +
                             theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

final_plot <- plot_grid(title_final,
          plotsA_title_final,
          subtitle_final,
          plotsA_final,
          plotsB_title_final,
          subtitle_final,
          plotsB_final,
          plotsC_title_final,
          subtitle_final,
          plotsC_final,
          legend_final,
          ncol = 1,
          rel_heights = c(0.2,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.1))

final_plot
```

```{r, echo=FALSE, include=FALSE}
ggsave("final_plot.png")
```

## **Suggested Homework**
*** 

<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
<div class = "blue">

+ Apply survey weights to one of the figures produced in this case study in which weighted estimates were not produced. Include error bars in the updated figure.
    + Does the figure change after the application of survey weights?
    + If so, describe how. 
+ Reproduce `final_plot` above for a different cohort of your choice.

</div>

### Notes

Ever and current variables are limited to those shared by all years of data included in this case study.

```{r, echo=FALSE}
knit_time_end <- Sys.time()
```

```{r, echo=FALSE}
knit_time <- knit_time_end - knit_time_start
knit_time_message <- paste("Knit time:",
      round(as.numeric(knit_time) ,3),
      units(knit_time)
      )
```

<p style="color:red">New code: `r knit_time_message`. Previous code: ~ 3 m. </p>

### Problems

I had difficulty producing a plot that succinctly presented a trend. It's very easy to produce plots that are very useful once one is familiar with the data. Some plots, however, cannot stand alone and need additional context to be clear for those without prior knowledge of the data. When I first shared a plot I had been working on with others, it became clear that in my effort to present a complicated idea briefly I had left out information that would make the trend easily interpretable. To solve this issue, I began to present visualizations of the data alongside my original plot. The final figure I created contained several additional plots, each presenting the same trend at a different level than my initial plot.

My "centerpiece" plot is the middle plot in `final_plot`. The 8 plots around it help provide a very clear picture of what is going on in the US with regards to e-cigarette use and nicotine product use at large. On its own, it's difficult to understand the trends in the US and how important the weighting scheme is for inference. Once you add the left and right columns, it's clear what is going on. 



## **Data Analysis**
*** 

### **content header**
*** 



## **Summary**
*** 



## **Helpful Links**
*** 

review of [tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/){target="_blank"} 

guide for [preprocessing with recipes](http://www.rebeccabarter.com/blog/2019-06-06_pre_processing/)


## **Session info**
***

```{r}
sessionInfo()
```
